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Abstract 

Importance sampling algorithms are discussed in detail, with an emphasis on implicit sampling, 
and applied to data assimilation via particle filters. Implicit sampling makes it possible to use the 
data to find high-probability samples at relatively low cost, making the assimilation more efficient. 
A new analysis of the feasibility of data assimilation is presented, showing in detail why feasibility 
depends on the Frobenius norm of the covariance matrix of the noise and not on the number 
of variables. A discussion of the convergence of particular particle filters follows. A major open 
problem in numerical data assimilation is the determination of appropriate priors; a progress report 
on recent work on this problem is given. The analysis highlights the need for a careful attention 
both to the data and to the physics in data assimilation problems. 

1 Introduction 

Bayesian methods for estimating parameters and states in complex systems are widely used in sci¬ 
ence and engineering; they combine a prior distribution of the quantities of interest, often generated 
by computation, with data from observations, to produce a posterior distribution from which reli¬ 
able inferences can be made. Some recent applications of these methods, for example in geophysics, 
involve more unknowns, larger data sets, and more nonlinear systems than earlier applications, and 
present new challenges in their use (see, e.g. [7| [T^[I^|43[|45| for recent reviews of Bayesian methods 
in engineering and physics). 

The emphasis in the present paper is on data assimilation via particle filters, which requires 
effective sampling. We give a preliminary discussion of importance sampling and explain that 
implicit sampling ( ^12|13[|32| is an effective importance sampling method. We then present implicit 
sampling methods for calculating a posterior distribution of the unknowns of interest, given a prior 
distribution and a distribution of the observation errors, first in a parameter estimation problem, 
then in a data assimilation problem where the prior is generated by solving stochastic differential 
equations with a given noise. Conditions for data assimilation to be feasible in principle and 
their physical interpretation are discussed (see also |llj). A linear convergence analysis for data 
assimilation methods shows that Monte Carlo methods converge for many physically meaningful 
data assimilation problems, provided that the numerical analysis is appropriate and that the size 
of the noise is small enough in the appropriate norm, even when the number of variables is very 
large. The keys to success are a careful use of data and a careful attention to correlations. 

A very important open problem in the practical application of data assimilation methods is 
the difficulty in finding suitable error models, or priors. We discuss a family of problems where 
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priors can be found, with an example. Here too a proper use of data and a careful attention to 
correlations are the keys to success. 


2 Importance sampling 

Consider the problem of sampling a given probability density function (pdf) / on the computer, 
i.e., given a probability density function (pdf) for an m-dimensional random vector, construct a 
sequence of vectors Xi, X 2 ,..., which can pass statistical independence tests, and whose histogram 
converges to the graph of /, in symbols, find Xi ~ /. We discuss this problem in the context of 
numerical quadrature. Suppose one wishes to evaluate the integral 

I = j g{y)f{y)dy, 


where g{y) is a given function and f{y) is a pdf, i.e., f{y) > 0 and f f dy = 1. The integral I equals 
E[g(x)], where x is a random variable with pdf / and E[-] denotes an expected value. This integral 
can be approximated via the law of large numbers as 

1 

I = E[g{x)] ^ 

i=l 


where Xi ^ f and M is a large integer (the number of samples we draw). The error is of the order 
of a{g{x))/M^^^, where cr(-) denotes the standard deviation of the variable in the parentheses. 
This assumes that one knows how to hnd the samples Xi, which in general is difficult. One way 
to proceed is importance sampling (see, e.g., [8,24 ): introduce an auxiliary pdf /o, often called an 
importance function, which never vanishes unless / does, and which one knows how to sample, and 
rewrite the integral as 

I = j siy)^^ foiy)dy = E [g{x*)w{x*)] 

where the pdf of x* is /o and w{x*) = f{x*)/fo{x*). The expected value can then be approximated 
by 

1 ^ 

Y,s{x;mx;) ( 1 ) 




i=l 


where X* ~ /o and the w{X*) = f{X*)/fo{X*) are the sampling weights. The error in this 
approximation is of order a{gw) (here the standard deviation is computed with respect to 

the pdf /o). The sampling weights make up for the fact that one is sampling the wrong pdf. 

Importance sampling can work well if / and /o are fairly close to each other. Suppose they 
are not (see figure j^. Then many of the samples Xi (drawn from the importance function) fall 
where / is negligible and their sampling weight is small. The computing effort used to generate 
such samples is wasted since the low-weight samples contribute little to the approximation of the 
expected value. In fact, one can define an effective number of samples as 


MeS 


M E [w^] 


so that in particular, if all the particles have weights 1/M, = M. The effective number of 

samples approximates the equivalent number of independent, unweighted samples one has after 
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Figure 1: Poor choice of importance function; the importance function (purple) is not close to the 
pdf we wish to sample (blue). 


importance sampling with M samples. Importance sampling is efficient if R is smaller than M. For 
example, suppose that you find that one weight is Rmuch larger than all the others. Then R ~ M, 
so that one has effectively one sample and this sample does not necessarily have a high probability. 
In this case, the importance sampling scheme has “collapsed”. To find /o that prevents a collapse, 
one has to know the shape of / quite well, in particular know the region where / is large. In some 
of the examples below, the whole problem is to estimate where a particular pdf is large, and it is 
not realistic to assume that this is known in advance. 

As the number of variables increases, the value of a carefully designed importance function also 
increases. This can be illustrated by an example. Suppose that / = AA(0,/m), where Im is the 
identity matrix of order m (here and below we denote a Gaussian with mean n and covariance matrix 
Q by Q)). To sample this Gaussian we pick a Gaussian importance function /o = AA(0, 
a > 1/2. The importance function has a shape similar to that of the function we wish to sample 
and /o is large where / is large (for moderate a). Nonetheless, sampling becomes increasingly 
costly as the number of components of x increases. We find that 


w{x) = exp ( ^ x'^x 




where the superscript T denotes a transpose, so that 

E [tc(x)] = 1, E [ui(x)^] = (2 (t^ — l) 

which gives 


R = 


a 


V2ct2 - 1, 

The number of samples required for a given Mgff thus grows exponentially with the number of 
variables m: 


log(M) = m log 


cr 


+ log(Meff). 


V2(t2 - F 

The situation is illustrated in figure As the number of variables increases, the cost of sampling, 
measured by the number of samples required to obtain the pre-specified number of effective samples, 
grows exponentially with the number of variables for any u > 1/2 except <7 = 1; see also the 


discussion in 35 


3 Implicit sampling 


Implicit sampling is a general numerical method for constructing effective importance functions 
32]. We write the pdf to sample as / = and, temporarily and for simplicity, assume 
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Figure 2: The number of samples required increases exponentially with the number of variables. 


that F(x) is convex. The idea is to write x as a one-to-one and onto function of an easy-to-sample 
reference variable ^ with pdf g oc were G is convex. To find this function, we first hnd the 

minima of F, G, call them (j)p = minT, (pc = minG, and define (p = (pp — (pc- Then we define a 
mapping ip : ^ ^ x such that ^ and x satisfy the equation 

Fix)-P> = G{C). ( 2 ) 


With our convexity assumptions a one-to-one and onto map exists, in fact there are many, because 
equation ([^ is underdetermined (it is a single equation that connects the m elements of Xi to 
the m elements of Hj, where m is the number of variables). To find samples Xi, X 2 ,..., first hnd 
samples Hi, H 2 ,... of and for each i = 1, 2,..., solve (§. A sample Xi obtained this way has a 
high probability (with a high probability) for the following reasons. A sample Hj of the reference 
density g is likely to lie near the minimum of G, so that the difference (G(Hj) — cpc) is likely to 
be small. By equation Q the difference {F{Xi) — cpp) is also likely to be small and the sample 
Xi is in the region where / is large. Thus, by solving (i), we map the high-probability region of 
the reference variable ^ to the high-probability region of x, so that one obtains a high-probability 
sample Xi from a high-probability sample Hj. 

The mapping ip defines the importance function implicitly by the solutions of Q: 


foixiO) = giO 



A short calculation shows that the sampling weight w{Xi) is 


w{Xi)^e-<^\J{X,)\, (3) 

where J is the Jacobian det((9x/5^). The factor 6“*^, common to all the weights, is immaterial 
here, but plays an important role further below. 

We now give an example of a map ip that makes this construction easy to implement. We 
assume here and for the rest of the paper that the reference variable ^ is the Gaussian AA(0,/m), 
where m is the dimension of x. With a Gaussian cpQ = 0 and equation ([^ becomes 

F{x) -(p= (4) 

We can hnd a map that satishes this equation by looking for solutions in a random direction 
g = i.e. use a mapping ip such that 


X = g + XLg, 


( 5 ) 
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where n = argminF is the minimizer of F, L \s a given matrix, and A is a scalar that depends 
on Substitution of the above mapping into Q gives a scalar equation in the single variable A 
(regardless of the dimension of x). The matrix L can be chosen to optimize the distribution of 
the sampling weights. For example, if L is a square root of the inverse of the Hessian of F at the 
minimum, then the algorithm is affine invariant, which is important in applications with multiple 
scale s [^ . Equation ([^ can be readily solved and the resulting Jacobian is easy to calculate 
(see |32| for details). 

Other constructions of suitable maps ip are presented in e.g. 
analyzed in 


12,20 and some of these are 


20| . With these constructions, often the most expensive part of the calculation is 
finding the minimum of F. Note that this needs to be done only once for each sampling problem, and 
that finding the maximum of / is an unavoidable part of any useful choice of importance function, 
explicitly or implicitly. Addressing the need for the optimization explicitly has the advantage 
that effective optimization methods can be brought to bear. Connections between optimization 
and (implicit) sampling also have been pointed out in [3,33 . In fact, an alternate derivation of 
implicit sampling can be given by starting from maximum likelihood estimation, followed by a 
randomization that removes bias and provides error estimates. 

We now relax the assumption that F is convex. If F is [/-shaped, the above construction works 
without modification. A scalar function F is [/-shaped if it is piecewise differentiable, its first 
derivative vanishes at a single point which is a minimum, F is strictly decreasing on one side of 
the minimum and strictly increasing on the other, and F{x) —>• oo as |x| —)• oo; in the general 
case, F is [/-shaped if it has a single minimum and each intersection of the graph of the function 
y = F{x) with a vertical plane through the minimum is [/-shaped in the scalar sense. If F is 
not [/-shaped, but has only one minimum, one can replace it by a [/-shaped approximation, say 
Fq, and then apply implicit sampling as above. The bias created by this approximation can be 
annulled by reweighting 12 . If there are multiple minima, one can represent / as a mixture of 


components whose logarithms are [/-shaped, and then pick as a reference pdf g a cross-product of 
a Gaussian and a discrete random variable. However, the decomposition into a suitable mixture 
can be laborious (see (4^). 


4 Beyond implicit sampling 

Implicit sampling produces a sequence of samples that lie in the high-probability domain and are 
weighted by a Jacobian. It is natural to wonder whether one can absorb the Jacobian into the map 
■0 : ^ —7- X and obtain “optimal” samples that need no weights (here and below, optimal refers to 
minimum variance of the weights, i.e. an optimal sampler is one with a constant weight function). 
If a random variable x with pdf / is a one-to-one function of a variable ^ with pdf g, then 

f{x{0)=9{0m, ( 6 ) 

where J is the Jacobian of the map p. One can obtain samples of x directly (i.e. without weights) 
by solving (©• Notable efforts in that direction can be found in [34[|38| . 

However, optimal samplers can be expensive to use; the difficulties can be demonstrated by 
the following construction. Consider a one-variable pdf / with a convex F = — log /. Find the 
maximizer z of /, with maximum Mf = f{z), and let g be the pdf of a reference variable with 
its maximizer also at z and maximum Mg = g{z). To simplify the notations, change variables so 
that z = 0. Then construct a mapping p : ^ ^ x as follows. Solve the differential equation 

dM ^ g^ 
dt /(u)’ 
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where t is the independent variable and u is a function of t, in the t-interval [ 0 ,^], with initial 
condition u = 0 when t = 0. Define the map from ^ to x by a: = m(C); then /(x) = g{i)J, where 
J = \du/dx\ evaluated at t = ^. It is easy to see that under the assumptions made, this map is 
one-to-one and onto. We have achieved optimal sampling. 

The catch is that the initial value of g{t)/f{u) is Mg/Mf, which requires that the normalization 
constants of / and g be known. In general the solution of the differential equation does not depend 
linearly on the initial value of g'(O) /f{u{0)), so that unless Mj and Mg are known, the samples are in 
the wrong place while weights to remove the resulting bias are unavailable. Analogous conclusions 
were reached in 16 for a different sampling scheme. In the special case where F = — log / and 


G = — log g are both quadratic functions the constants Mf, Mg can be easily calculated, but under 
these conditions one can find a linear map that satisfies equation Q and implicit sampling becomes 
identical to optimal sampling. The elimination of all variability in the weights in the nonlinear case 
is rarely worth the cost of computing the normalization constants. 

Note that the normalization constants are not needed for implicit sampling. The mapping Q 
for solving Q is well-defined even when the normalization constants of / and g are unknown. The 
reason is that if one multiplies the functions / or 5 by a constant, the logarithm of this constant 
is added both to F {G) and to cpp {4 >g) and cancels out. Implicit sampling simplifies equation (§ 
by assuming that the Jacobian is a constant. It produces weighted samples where the weights are 
known only up to a constant, and this indefiniteness can be removed by normalizing the weights 
so that their sum equals 1 . 


5 Bayesian parameter estimation 

We now apply implicit sampling to Bayesian estimation. For the sake of clarity we first explain 
how to use Bayesian estimation to determine physical parameters needed in the numerical modeling 
of physical processes, given noisy observations of these processes. This discussion explains how a 
prior and data combine to create a posterior proability density that is used for inference. In the 
next section we extend the methodology to data assimilation, which is our main focus. 

Suppose one wishes to model the diffusion of some material in a given medium. The density of 
the diffusing material can be described by a diffusion equation, provided the diffusion coefficients 

6 G MF can be determined. Given the coefficients 9, one can solve the equation and determine 
the values of the density at a set of points in space and time. The values of the density at these 
points can be measured experimentally, by some method with an error g whose probability density 
is assumed known; once the measurements d G are made, this assigns a probability to 9. The 
relation between d and 9 can be written as 

d = h{9)+g, ( 8 ) 

where h : —)■ M”* is a generally nonlinear function, and g ~ Prii') is a random variable with 

known pdf that represents the uncertainty in the measurements. The evaluation of h often requires 
a solution of a differential equation and can be expensive. In the Bayesian approach, one assumes 
that information about the parameters is available in form of a prior poi9). This prior and the 
likelihood p{d\9) = Pg{d — h{9)), defined by Q, are combined via Bayes’ rule to give the posterior 
pdf, which is the probability density function for the parameters 9 given the data d, 

Pi^\d) = -^Po{9)p{d\9), (9) 

7(d) 

where 7 (d) = f po(9)p(d\9)d9 is a normalization constant (which is hard to compute). 
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In a Monte Carlo approach to the determination of the parameters, one finds samples of the 
posterior pdf. These samples can, for example, be used to approximate the expected value 

E[6\d] = j ep{e\d)de, 

which is the best approximation of 0 in a least squares sense (see, e.g. [^), and a error estimate 
can also be computed, see also (4^ . 

When sampling the posterior p{9\d), it is natural to set the importance function equal to the 
prior probability po{6) and then weight the samples by the likelihood p{d\0). However, if the data 
are informative, i.e., if the measurements d modify the prior significantly, then the prior and the 
posterior can be very different and this importance sampling scheme is ineffective. When implicit 
sampling is used to sample the posterior, then the data are taken into account already when 
generating the samples, and not only in the weights of the samples. 

As the number of variables increases, the prior po and the posterior p{6\d) may become nearly 
mutually singular. In fact, in most practical problems these pdfs are negligible outside a small 
spherical domain so that the odds that the prior and the posterior have a significant overlap 
decrease quickly with the the number of variables, making implicit sampling increasingly useful 
(see for an application of implicit sampling to parameter estimation). 


6 Data assimilation 


We now apply Bayesian estimation via implicit sampling to data assimilation, in which one makes 
inferences from an unreliable time-dependent approximate model that defines a prior, supplemented 
by a stream of noisy and/or partial data. We assume here that the model is a discrete recursion of 
the form 

+ ( 10 ) 

where n is a discrete time, n = 0, 1, 2 ,..., the set of m-dimensional vectors = {x^,x ^,. . . , x"', . . . ) 
are the state vectors we wish to estimate, /(•) is a given function, and tc"' is a Gaussian random 
vector N'{0,Q). The model often represents a discretization of a stochastic differential equation. 
We assume that /c-component data vectors bn, n = 1,2,... are related to the states x” by 


IT = h{x^) + rj^, 


( 11 ) 


where /i(-) is the (generally nonlinear) observation function and rj^ is a J\f{0,R) Gaussian vector 
(the r/” and are independent of rj^ and for k < n and also independent of each other). 
In addition, initial data x*^, which may be random, are given at time n = 0. For simplicity, we 
consider in this section the problem of estimating the state x of the system rather than estimating 
parameters in the equations as in the previous section. Thus, we wish to sample the conditional 
pdf p(x°'”| 6 ^'"'), which describes the probability of the sequence of states between 0 and n given the 
data 6 ^'”' = {b ^,..., h^). The samples of this conditional pdf are sequences , X^,X ‘^,..., usually 
referred to as “particles”. The conditional pdf satisfies the recursion 


I /Ln+1 1 „n+l 

(x0:n+l|^l:n+l) ^ (^0:n|^l:n)P(^- f 

I / I J p(5n+l|5l:n) 


( 12 ) 


where p(x”''''^|x”) is determined by equation ( 10 ) and p{b"'^^\x"'^^) is determined by equation ([ll|). 


(see e.g. H^). We wish to sample the conditional pdf recursively, time step after time step, which 
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is natural in problems where the data are obtained sequentially. To do this we use an importance 
function vr of the form 


n+1 




(13) 


k=l 


where the vr^ are increments of the importance function. The recursion (12) and the factorization 
(13) yield a recursion for the weight of the j-th particle, 


p0:n+l|^l:n+l 


VT -+1 = ^ _ 

] ^( 3 , 0 :n+l| 5 l:n+l) 


^ 7r„+i(X"+i|X0.-,6i-) 


(14) 


With this recursion, the task at hand is to pick a suitable update 'Kk{x^\x^-^ 6^'^) for each particle, 

to sample p{X^^^\X'^)p{h'^^^\X^^^), and to update the weights so that the pdf 


is well-represented. Thus, the solution of the discrete model (10) plays the same role in data 


assimilation as the prior in the parameter estimation problem of section 5. 
In the SIR (sequential importance sampling) filter one picks 


'^n+l {x 




X 




J ' 3 

and the weight is oc p{lP~^^\X^^^). Thus, the SIR filter proposes samples by the 


model (10) and determines their probability from the data (11). 

In implicit data assimilation one samples 7r„+i(x”'’'^|x®^5^'”'+^) by implicit sampling, thus 
taking the most recent data into account for generating the samples. The weight of each 


particle is given by equation (14) 


wn+i ^ vlTexp(-(/)”+i)J(X;+^), 

where = argmin is the minimum of 

^n+l^^n+l) = _log {p{x^+^\X])p{h^+^\x'^+^)) . 

Thus, a minimization is required for each particle at each step. 

The “optimal” importance function 16,26,^ is defined by @ with 


and its weights are 




= Wf p{V^+^\X^). 


'3 '3 

This choice of importance functions is optimal in the sense that it minimizes the variance of the 
weights at the n -|- 1 step given the data and the particle positions at the previous step. In fact. 


this importance function is optimal over all importance functions that factorize as in (13), and in 
the sense that the variance of the weights var(rc 
respect to 7r(x°-’^+^|6°-’^+^)), see 


42 


implicit filter and the optimal filter coincide 13,29 32 


is minimized (with expectations computed with 
In a linear problem where all the noises are Gaussian, the 
When a problem is nonlinear, the optimal 


filter may be hard to implement, as discussed above. 

A variety of other constructions is available (see e.g. [1,46-^). The SIR and optimal filters 
are the two extreme cases, in the sense that one of them makes no use of the data in finding 
samples while the other makes maximum use of the data. The optimal filter becomes impractical 
for nonlinear problems while the implicit filter can be implemented at a reasonable cost. Implicit 
sampling thus balances the use of data in sampling and the computational cost. 
























7 The size of a covariance matrix and the feasibility of data as¬ 
similation 


Particle filters do not necessarily succeed in assimilating data. There are many possible causes- for 


example, the recursion (10) may be unstable, or the data may be inconsistent. The most frequent 


cause of failure is filter collapse, in which only one particle has a non-negligible weight, so that there 
is effectively only one particle, which is not sufficient for valid inferences. In the next section we 
analyze how particle collapse happens and how to avoid it. In preparation for this discussion, we 
discuss here the Frobenius norm of a covariance matrix, its geometrical significance, and physical 
interpretation. 

The effectiveness of a sampling algorithm depends on the pdf that it samples. For example, 
suppose that the posterior you want to sample is in one variable, but has a large variance. Then 
there is a large uncertainty in the state even after the data are taken into account, so that not 
much can be said about the state. We wish to assess the conditions under which the posterior can 
be used to draw reliable conclusions about the state, i.e. we make the statement “the uncertainty is 
not too large” quantitative. We call sampling problems with a small enough uncertainty “feasible 
in principle”. There is no reason to attempt to solve a sampling problem that is not feasible in 
principle. 

Our analysis is linear and Gaussian and we allow the number of variables to be large, in 
multivariate Gaussian problems, feasibility requires that a suitable norm of the covariance matrix 
be small enough. This analysis is inspired by geophysical applications where the number of variables 


is large, but the nonlinearity may be mild; parts of it were presented in 11 


We describe the size of the uncertainty in multivariate problems by the size the mxm covariance 
matrix P = (pij), which we measure by its Frobenius norm 

1/2 

ipiif= lEErf/l =i>hn . (15) 


i=l j=l 


' m 

\k=l 


where the Afc, k = 1,... ,m are the eigenvalues of P. The reason for this choice is the geometric 
significance of the Frobenius norm. Let x ~ W(/r, P) be an m-dimensional Gaussian variable, and 
consider the distance between a sample of x and its most likely value p 

r = \\x- ii\\2 = ^Jix- n)'^{x- fi). 

If all eigenvalues of P are 0(1) and if m is large, then E[r], the expected value of r, is 0(||P||i?) 
and var{r) = 0(1), i.e. the samples concentrate on a thin shell, far from their most likely value. 
Different parts of the shell may correspond to physically different scenarios (such as “rain” or “no 
rain”). Moreover, the expected value and variance of the distance distance r = y/y of a sample to 
its most likely value are bounded above and below by multiples of ||P||f (see [^). If one tries to 
estimate the state of a system with a large ||P||i;’, the Euclidean distance between a sample and 
the resulting estimate is likely to be large, making the estimate useless. For a feasible problem we 
thus require that ||P||f not be too large. How large “large” can be depends on the problem. 

In and in earlier work on the feasibility of particle filters, the Frobenius norm of 

P was called the “effective dimension”. This terminology is confusing and we abandon it here. We 
show below that ||P||f quantihes the strength of the noise. We define the effective dimension of 
the noise by 

l oo 

min£ G Z ; ^A2 >(l-e)^A2 , (16) 

i=i /=i 
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where e is a predetermined small parameter. A small ||i^||F can imply a small i, however the 
reverse is not necessarily true and can be small, even though £ is large (for example if 

X ~ AA(0,/m/\/w)) and vice versa. There are small dimensional problems with a large noise (large 
||P||i;’) which are not feasible in principle, and there are large dimensional problems with a small 
noise which are feasible in principle (small ||P||i7’). Estimation based on a posterior pdf is feasible 
in principle only if ||P||f is small. 

We now examine the relations between ||P||j7’, the effective dimension £, and the correlations 
between the components of the noise (i.e., the size of the off-diagonal terms in P). We assume 
that our random variables are point values Xi of a stationary stochastic process on the real line, 
measured at the points ih of the finite interval [0,1], where i = l,2 ,...,m and h = 1/m. As 
the mesh is refined, m increases. The condition for the discrete problem to be feasible is that the 
Frobenius norm of its covariance matrix be bounded. Let the covariance matrix of the continuous 
process be k{x,P) = k{x — P); the corresponding covariance operator is defined by 

Kip = / k{x,x')p{x')dx 

Jo 

for every function p = p{x). The eigenvalues of K can be approximated by the eigenvalues of P 
multiplied by h (see [^). The Frobenius norm of K, 

1 

k'^ dx dx', 

describes the distance of a sample to its most likely value in the sense, as can be seen by 
following the same steps as in [11] , replacing the Euclidean norm by the appropriate grid-norm, 

II 112 1 2 

i-e- jja^lb = 22k=ihXk. 

We consider a family of Gaussian stationary processes with differing correlation structures. The 
discussion is simplified if we keep the energy e defined by 

/ OO 

k{y,y'fdy'. 

-OO 

equal to 1 for all the members of the family (so that all the members of the family are equally 
noisy); that e is an energy follows from Khinchin’s theorem (see e.g. [^). An example of such a 
family is the family of zero-mean processes with 

. _i _i 

k[x, X ) = TT iL 2 exp 


^ jx-x'y ^ 



where L is the correlation length. The infinite limits in the definition of e make the calculations 
easier, and are harmless as long as L is less than about 1/2. The elements pij of the discrete matrix 
P are pij = k{ih,jh), i, j = 1,... ,m. 

In the left panel of figure we demonstrate that for a fixed correlation length L = 0.1, the 
Frobenius norm 

( m 

k=l 



where the Xk are the eigenvalues of the covariance matrix P, is independent of the mesh (for small 
enough h) and, therefore independent of the discretization. In the calculation of the dimension 
a. in (16), we use e = 5%. In the figure, we plot the Frobenius norms of the m x m matrices P 
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fixed con'elation length L 


dimension I 


_ 

Frobenius norm 


1o’ 10^ 10^ 

number of variables m = llh 



Figure 3: Norms of covariance matrices and dimension i for a family of Gaussian processes with 
constant energy as a function of the mesh size h (left) and as a function of the correlation length 
L (right). 


as the purple and red) dashed lines. The solid line is the Frobenius of the covariance operator K 
of the continuous process, which, for this simple example, can be computed analytically from the 
expansion 

OO 

Ky^y') = '^^jej{y)ej{y'), 
i=i 

where A is an eigenvalue of k{y,y') with eigenvector e{y). The eigenvalues and eigenvectors are 


defined by 

/ k{y^y')<^iy) dy = Xe{y'), 


Jo 

and 

fZZise. 

Thus, 



j j ^(2/,2/)^ dW = ^ = ttL l^^exp + V^Erf 

Since A ~ /lA, we have that 


We find good agreement between the infinite dimensional results and their finite dimensional ap¬ 
proximations (the dashed lines are mostly invisible because they coincide with the results calculated 
from the infinite dimensional problem). 

The right panel of the figure shows the variation of ||T’||f and the dimension t with the corre¬ 
lation length L. We observe that the Frobenius norm remains almost constant for L < 10“^. On 
the other hand, the dimension i in (16) increases as the correlation length decreases. What this 
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figure shows is that the feasibility of data assimilation depends on the level of noise, but not on 
the effective number of variables. One can find the same level of noise for very different effective 
dimensions. If data assimilation is not feasible, it is because the problem has too much noise, not 
too many variables. 

It is interesting to consider the limit L —)• 0 in our family of processes. A stationary process 
u{x) such that for every x, u{x) is a Gaussian variable, with E[u{x)u{x')] = 0 for x 7 ^ x' and 
E[u{x)‘^] = A, where A is a finite constant, has very little energy; white noise, the most widely used 
Gaussian process with independent values, has E[u{x)u{x')] = 6x^x', where the right-hand-side is a 
5 function, so that E[u{x)‘^] is unbounded. The energy of white noise is infinite, as is well-known. 
If k{x — x') in our family blew up like L~^ at the origin as L —0, the process would be a multiple 
of Brownian motion; here it blows up like allowing the energy to remain finite while not 

allowing the to remain bounded. The moral of this discussion is that a sampling problem 

where P = Im for all m is unphysical. 

An apparent paradox appears when one applies our theory to determine the feasibility of a 
Gaussian random variable with covariance matrix P = /m, as in (4|[^ |40f(4^ . In this case, HPHf = 
\/m, so that the problem is infeasible by our definition if the number of variables m is large. On 
the other hand, the problem seems to be physically plausible. For example, suppose one wishes to 
estimate the wind velocity at m geographical sites based on independent local models for each site 
(e.g. today’s wind velocity is yesterday’s wind velocity with some uncertainty), and independent 
noisy measurements at each site. The local Pj at each site is one-dimensional and equals 1. Why 
then not try to determine the m velocities in one fell swoop, by considering the whole vector x 
and setting P = Im"! Intuition suggest that the set of local problems is equivalent to the P = Im 
problem, e.g. that one can use local stochastic models to predict the velocities at nearby sites, 
and then mark these on a weather map and obtain a plausible global field. Thus, HT’Hf is large in 
a plausible and feasible problem. This, however, is not so. First, in reality, the uncertainties in the 
velocities at nearby sites are highly correlated, while the problem P = Im assumes that this is not 
so. The resulting velocities map from the latter will be unphysical and lead to wrong forecasts- large 
scale coherent flows in today’s weather map will have a different impact on tomorrow’s forecast 
than a set of uncorrelated wind patterns, and of course carry a much larger energy, even if the 
local amplitudes are the same. If one has a global problem where the covariance matrix is truly 
Im, then replacing the solution of the full problem by a component-wise solution changes estimate 
of the noise from HPHf to the numerically smaller maximum of the component variances, which is 
not as good a measure of the distance between the samples and their mean. 

8 Linear analysis of the convergence of data assimilation 

We now summarize the linear analysis of the conditions under which particular particle filters 
produce reliable estimates when the problem is linear (see [^). A general nonlinear analysis is not 
within reach, while the linear analysis captures the main issues. 

The dynamical equation now takes the form: 

x^+i = Ax" + re", (17) 

where A is an m x m matrix and, the data equation becomes 

= 77x"+i + r/", (18) 

where H is an k x m matrix. As before, re" are independent AA(0, Q) and the r/" are independent 
Af{0, R), independent also of the tc". We assume that equation 0 is stable and the data are suffi¬ 
cient for assimilation (for a more technical discussion of these assumptions, see e.g. [11]). These two 
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equations define a posterior probability density, independently of any method of estimation. The 
first question is, under what conditions is state estimation with the posterior feasible in principle. 
If one wants to estimate the state at time n, given the data up to time n, this requires that the 
covariance of x^\b^''"' have a small Frobenius norm. 

The theoretical analysis of the Kalman filter |23| can be used to estimate this covariance, even 
when the problem is not feasible in principle or the Kalman filter itself is too expensive for practical 
use. Under wide conditions, the covariance of rapidly reaches a steady state 

P = {I - KH)X, 


where X is the unique positive semi-definite solution of a particular nonlinear algebraic Riccati 
equation (see e.g. [^) and 

K = XH^{HXH'^ + R)-\ 


One can calculate HRHf and decide whether the estimation problem is feasible in principle. Thus, 
a condition for successful data assimilation is that HRHf be moderate, which generaly requires that 
the IIQIIf) ||-R||f in equations (0 and ( |18[ ) be small enough. 

A particle filter can be used to estimate the state by sampling the posterior pdf , 

defined by 0 and ( [T^ . The state at time n conditioned on the data up to time n can be computed 
by marginalizing samples of p{x^''^~^^\b^'''^~^^) to samples of p{x'^\b^'''^~^^) (which amounts to simply 
dropping the sample’s past). Thus, a particle filter does not directly sample p{x'^\b^''^~^^), so that 
its samples carry weights (even in linear problems). These weights must not vary too much or else 
the filter “collapses” (see section 2). Thus, a particle filter can fail, even if the estimation problem 
is feasible in principle. 

It was shown in [^[6 ,11,40, ^ that the variance of the negative logarithm of the weight must 
be small or else the filter collapses. For the SIR filter, the condition that this collapse not happen 
is that the Frobenius norm of the matrix 


^sik = H{Q + APA^)H^R-\ 


(19) 


be small enough. For the optimal filter, which in the present linear setting coincides with the 
implicit filter, success requires a small Frobenius norm for 

Sopt = HAPA^H^iHQH^ + R)~\ (20) 

see [11] . In either case, this additional condition must be satisfied as well as the the condition that 
||P||f is small. 

To understand what these formulas say, we apply them to a simple model problem which is 
popular as a test problem in the analysis of numerical weather prediction schemes. The problem 
is defined hy P[ = A = Im and Q = qlm, R = rim, where we vary the number of components m. 
Note that for a fixed m, the problem is parameterized by the variance r of the noise in the model 
and the variance q of the noise in the data. This problem is feasible in principle if 




q"^ + ^qr — q 


is not too large. For a fixed m, this means that 


\/q^ + 4:qr - q 


< 1 , 
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Data assimilation infeasible 


■5 

c= 



Figure 4: Conditions for feasible data assimilation and for successful sampling with the optimal 
particle filter, the particle filter and the ensemble Kalman filter (from [31]). 


where the number 1 stands in for a sharper estimate of the acceptable variance for a given problem 
(and this choice gives the complete qualitative story). In figure this condition is illustrated and 
shown are all feasible problems in white and all infeasible problems in grey. The analysis thus 
shows that for data assimilation to be feasible, either the model or the data must be accurate. If 
both are noisy, the noise dominates so that estimation is useless. 

In the SIR filter, the additional condition for success becomes 


| q 


2r 


< 1 , 


and for the optimal/implicit case, the additional condition is 


\/+ Aqr - q 
2{q + r) 


< 1 . 


( 21 ) 


In both cases, the number 1 on the right-hand-side stands in for a sharper estimate of the acceptable 
variance of the weights for a given problem (which also depends on the computing resources). In 
either case, the added condition is quadratic and homeogeneous in the ratio q/r, and thus slices out 
conical regions from the region where data assimilation is feasible in principle. These conical regions 
are shown in figure [^ The region where estimation with a particular filter is feasible is labeled with 
the name of the particular filter. Note that we also show results for the ensemble Kalman filter 
(EnKF) (Bl which are derived in [^, but not discussed in the present paper. The analysis 


of the EnKF relates to the situation where it is used to sample the same posterior density as the 
other filters quoted in the figure. The analysis in |31j also includes the situation where the EnKF 
is used to sample a marginal of that pdf directly, when the conclusions are different. 

In practical problems one usually tries to sharpen estimates obtained from approximate dy¬ 
namics with the help of accurate measurements, and the optimal/implicit filter works well in such 
problems. However, there is a region in which not even the optimal filter succeeds. What fails 
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there is the sequential approach to the estimation problem, i.e. the factorization of the importance 
function as in (13), see 31,42 . Non-recursive filters can succeed in this region, and there too 


implicit sampling can be helpful i.e. one can apply implicit sampling direction to 

One conclusion we reach from our analysis is that if Bayesian estimation is feasible in principle 
then it is feasible in practice. However, it is assumed in the previous analysis so far that one has 
suitable priors, or equivalently, one knows what the noise in equation (10) really is. The fact is 
that generally one does not. We discuss this issue in the next section. 


9 Estimating the prior 

The previous sections described how one may use a prior and a distribution of observation errors to 
estimate the states or the parameters of a system. This estimation depends on knowing the prior. 
As we have seen, in data assimilation, the prior is generated by the solution of an equation such 
as the recursion (10) and depends on knowing the noise rc" in that equation. It is often assumed 


that the noise in that equation is white. However, one can show that the noise is not white in most 
problems (see e.g. [9,14,^). We now present a preliminary discussion of methods for estimating 
the noise. 

First, one has to make some assumptions about the origin of the noise. A reasonable assumption 
(see e.g. (^[^) is that there is noise in the equations of motion because a practical description of 
nature is necessarily incomplete. For example, one can write a solvable equation of motion for 
a satellite turning around the earth by assuming that the gravitational pull is due to a perfectly 
spherical earth with a density that is a function only of distance from the center (see [^). Reality is 
different, and the difference produces noise, also known as model error. The problems to be solved 
are: (i) estimate this difference, (ii) identify it, i.e., find a concise approximate representation of 
this difference that can be effectively evaluated or sampled on a computer, and (in) design an 
algorithm that imbeds the identified noise in a practical data assimilation scheme. These problems 
have been discussed in a number of recent papers, e.g. [9,28 
which contains an extensive bibliography. 


in particular the review paper |2^ 


Assume that the full description of a physical system has the form; 

= R{x, y), = S{x, y), 


( 22 ) 


where t is the time, x = (xi, X 2 , • •., Xm) is the vector of resolved variables, and y = {yi,y 2 , ■ ■ ■ ,yi) 
is the vector of unresolved variables, with initial data x(0),y(0). Consider a situation where this 
system is too complicated to solve, but where data are available, usually as sequences of measured 
values of x, assumed here to be observed with negligible observation errors. Write R{x, y) in the 
form 

R{x,y) = Ro{x) + z{x,y), (23) 


where Rq is a practical approximation of R{x, y) and one is able to solve the equation 

i-x = mx). 


(24) 


However, x does not satisfy equation (24) because the true equation, the first of equations (22), is 


more complicated. The difference between the true equation and the approximate equation is the 
remainder z{x, y) = i2(x, y) — Ro{x), which is the noise in the determination of x. In general, 2 has 
to be determined from the data, i.e., from the observed values of x. 
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A usual approach to the problem of estimating z is to use equation (23) to obtain its values from 
X data, i.e. calculate z = — Ro{x), and then identify it as a continuous stochastic process. This 

may be difficult, in particular, calculating z requires that one differentiate x, which is generally 
impractical or inaccurate because z may have high-frequency components or fail to be sufficiently 
smooth, and the data may not be available at sufficiently small time intervals (an illuminating 
analysis in a special case can be found in [^[^). Once one has values of z, identifying it as 
a function of the continuous variable t often requires making unwarranted assumptions on the 
small-scale structure of z, which may be unknown when the data are available only at discrete 
times. 

An alternative is supplied by a discrete approach as in 


10 . Equation (24) is always solved on 


the computer, i.e., in discrete form, the data are always given at discrete points, and it is x one 
wishes to determine but in general one is not interested in determining z per se. We can therefore 
avoid the difficult detour through a continuous-time z followed by a discretization, as follows. We 


pick once and for all a particular discretization of equation (24) with a particular time step time 
step 6, 

x 


= x^ + 6Rsix^), (25) 


where Rs is obtained, for example, from a Runge-Kutta method, and where n indexes the result 
after n steps; the differential equation has been reduced to a recursion such as equation (10). We 
then use the data to identify the discrepancy sequence, = (x"'+^ — x"’)/<5 — Rs{x^), which is 
available from x data without approximation. 

Then assume, as one does in the continuous-time case, that the system under consideration is 
ergodic, so that its long-time statistics are stationary. The sequence z^ becomes a stationary time 
series, which can be represented by one of the representations of time series, e.g. the NARMAX 
(nonlinear auto-regression moving average with exogenous inputs) representation, with x as an 
exogenous input. The observed x of course may include observation errors, which have to be 
separated from the model noise via some version of filtering. 

As an example, we applied this construction to the Lorenz 96 system 


27 , created to serve as 


a metaphor for the atmosphere, which has been widely used as a test bench for various reduction 
and filtering schemes. It consists of a set of chaotic differential equations, which, following [18] 
write as: 


we 


Xk 


= Xk-i (Xfc+i - Xk- 2 ) - Xk + F + Zk, 


with 


dt 

~^yj,k — '^[yj+i,kiyj—i,k yj+2,k) yj,k T hyXk] 


hx 

Zk = -j 2_^ yj,kj 


(26) 


and k = 1,.. ., AT, j = 1,..., J. The indices are cyclic, Xk = Xk+K, yj,k = yj,k+K and yj+j^k = 
Vj^k+i- This system is invariant under spatial translations, and the statistical properties are identical 
for all Xk- The parameter e measures the time-scale separation between the resolved variables Xk 
and the unresolved variables yj^k- The parameters are e = 0.5, AT = 18, J = 20, A = lQ,hx = —1 
and hy = 1. The ergodicity of the Lorenz 96 system has been established numerically in earlier 
work (see e.g. [iS]). We pretend that this system is too difficult to integrate in full; we take as as 
Ro of equation ( |24[ ) the system is which is Zk = 0. The noise is then Zk, which in our special case 
can actually be calculated by solving the full system of equations; in general the noise has to be 

In figure we plot the covariance function 


estimated from data as described above and in 10,50 
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Figure 5: Autocorrelation function of the noise z in the Lorenz 96 system. 


of the noise component zi determined as just described. Note that z cannot be thought of white 
noise even remotely. To the extent that the Lorenz 96 is a valid metaphor for the atmosphere, we 
find that the noise in a realistic problem is not white. This can of course be deduced from the 
construction after equations (22), where the noise appears as the difference between solutions of 
differential equations, which is not likely to be white. 

This relation between the preceding discussion of the noise (which defines the prior through 
equation (23)) should be compared with the discussion of implicit sampling and particle filters. Here 
too the data have been put to additional use (there to define samples and not only to weight them, 
here to provide information about the noise as well as about the signal); here too an assumption 
of a white noise input has been found wanting, and realism requires non-trivial correlations, (there 
in space, here in time). 

The next step is to combine the recursion ( |25[ ) with observations to produce a state estimate. 
Recent work on this topic is summarized in [22[ . An example of this construction could not be 
produced in time to appear in this article. 


10 Conclusions 

We have presented algorithms for data assimilation and for estimating the prior, with some analysis. 
The work presented shows that the keys to success are a better use of the data and a careful analysis 
of what is physically plausible and useful. We feel that significant progress has been made. One 
conclusion we draw from this work is that data assimilation, which is often considered as a problem 
in statistics, should also, or even mainly, be viewed as a problem in computational physics. This 
conclusion has also been reached by others, see e.g. [I4|[^ . 
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